NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Decoding Rewards in Competitive Games: Inverse Game Theory with Entropy Regularization

Liao, Junyi; Zhu, Zihan; Fang, Ethan X; Yang, Zhuoran; Tarokh, Vahid (August 2025, ICML Proceedings)

Estimating the unknown reward functions driving agents' behavior is a central challenge in inverse games and reinforcement learning. This paper introduces a unified framework for reward function recovery in two-player zero-sum matrix games and Markov games with entropy regularization. Given observed player strategies and actions, we aim to reconstruct the underlying reward functions. This task is challenging due to the inherent ambiguity of inverse problems, the non-uniqueness of feasible rewards, and limited observational data coverage. To address these challenges, we establish reward function identifiability using the quantal response equilibrium (QRE) under linear assumptions. Building on this theoretical foundation, we propose an algorithm to learn reward from observed actions, designed to capture all plausible reward parameters by constructing confidence sets. Our algorithm works in both static and dynamic settings and is adaptable to incorporate other methods, such as Maximum Likelihood Estimation (MLE). We provide strong theoretical guarantees for the reliability and sample-efficiency of our algorithm. Empirical results demonstrate the framework’s effectiveness in accurately recovering reward functions across various scenarios, offering new insights into decision-making in competitive environments.
more » « less
Full Text Available
In-Context Reinforcement Learning From Suboptimal Historical Data

Dong, Juncheng; Guo, Moyang; Fang, Ethan X; Yang, Zhuoran; Tarokh, Vahid (August 2025, ICML Proceedings)

Transformer models have achieved remarkable empirical successes, largely due to their in-context learning capabilities. Inspired by this, we explore training an autoregressive transformer for in-context reinforcement learning (ICRL). In this setting, we initially train a transformer on an offline dataset consisting of trajectories collected from various RL tasks, and then fix and use this transformer to create an action policy for new RL tasks. Notably, we consider the setting where the offline dataset contains trajectories sampled from suboptimal behavioral policies. In this case, standard autoregressive training corresponds to imitation learning and results in suboptimal performance. To address this, we propose the Decision Importance Transformer (DIT) framework, which emulates the actor-critic algorithm in an in-context manner. In particular, we first train a transformer-based value function that estimates the advantage functions of the behavior policies that collected the suboptimal trajectories. Then we train a transformer-based policy via a weighted maximum likelihood estimation loss, where the weights are constructed based on the trained value function to steer the suboptimal policies to the optimal ones. We conduct extensive experiments to test the performance of DIT on both bandit and Markov Decision Process problems. Our results show that DIT achieves superior performance, particularly when the offline dataset contains suboptimal historical data.
more » « less
Full Text Available
In-Context Reinforcement Learning From Suboptimal Historical Data

Dong, Juncheng; Guo, Moyang; Fang, Ethan X; Yang, Zhuoran; Tarokh, Vahid (July 2025, 2025 International Conference on Machine Learning)

Full Text Available
What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization

Zhang, Yufeng; Zhang, Fengzhuo; Yang, Zhuoran; Wang, Zhaoran (May 2025, Journal of medicinal and chemical sciences)

In-Context Learning (ICL) ability has been found efficient across a wide range of applications, where the Large Language Models (LLM) learn to complete the tasks from the examples in the prompt without tuning the parameters. In this work, we conduct a comprehensive study to understand ICL from a statistical perspective. First, we show that the perfectly pretrained LLMs perform Bayesian Model Averaging (BMA) for ICL under a dynamic model of examples in the prompt. The average error analysis for ICL is then built for the perfectly pretrained LLMs with the analysis of BMA. Second, we demonstrate how the attention structure boosts the BMA implementation. With sufficient examples in the prompt, attention is proven to perform BMA under the Gaussian linear ICL model, which also motivates the explicit construction of the hidden concepts from the attention heads' values. Finally, we analyze the pretraining behavior of LLMs. The pretraining error is decomposed as the generalization error and the approximation error. The generalization error is upper bounded via the PAC-Bayes framework. Then the ICL average error of the pretrained LLMs is shown to be the sum of O(T^{-1}) and the pretraining error. In addition, we analyze the ICL performance of the pretrained LLMs with misspecified examples.
more » « less
Full Text Available
Contextual Dynamic Pricing with Strategic Buyers

https://doi.org/10.1080/01621459.2024.2370613

Liu, Pangpang; Yang, Zhuoran; Wang, Zhaoran; Sun, Will Wei (April 2025, Journal of the American Statistical Association)

Full Text Available
From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems

He, Jianliang; Chen, Siyu; Zhang, Fengzhuo; Yang, Zhuoran (July 2024, International Conference on Machine Learning 2024)

In this work, we theoretically investigate why large language model (LLM)-empowered agents can solve decision-making problems in the physical world. We consider a hierarchical reinforcement learning (RL) model where the LLM Planner handles high-level task planning and the Actor performs low-level execution. Within this model, the LLM Planner operates in a partially observable Markov decision process (POMDP), iteratively generating language-based subgoals through prompting. Assuming appropriate pretraining data, we prove that the pretrained LLM Planner effectively conducts Bayesian aggregated imitation learning (BAIL) via in-context learning. We also demonstrate the need for exploration beyond the subgoals produced by BAIL, showing that naively executing these subgoals results in linear regret. To address this, we propose an ε-greedy exploration strategy for BAIL, which we prove achieves sublinear regret when pretraining error is low. Finally, we extend our theoretical framework to cases where the LLM Planner acts as a world model to infer the environment’s transition model and to multi-agent settings, facilitating coordination among multiple Actors.
more » « less
Full Text Available
Online Performative Gradient Descent for Learning Nash Equilibria in Decision-Dependent Games

Zhu, Zihan; Fang, Ethan X; Yang, Zhuoran (December 2023, Conference on Neural Information Processing Systems)

Full Text Available
Joint Differentiable Optimization and Verification for Certified Reinforcement Learning

https://doi.org/10.1145/3576841.3585919

Wang, Yixuan; Zhan, Simon; Wang, Zhilu; Huang, Chao; Wang, Zhaoran; Yang, Zhuoran; Zhu, Qi (May 2023, ACM)
Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement Learning in Unknown Stochastic Environments

Wang, Yixuan; Zhan, Simon Sinong; Jiao, Ruochen; Wang, Zhilu; Jin, Wanxin; Yang, Zhuoran; Wang, Zhaoran; Huang, Chao; Zhu, Qi (July 2023, 40th International Conference on Machine Learning (ICML’23))
Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium

https://doi.org/10.1287/moor.2022.1268

Xie, Qiaomin; Chen, Yudong; Wang, Zhaoran; Yang, Zhuoran (June 2022, Mathematics of Operations Research)

We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games with simultaneous moves. To incorporate function approximation, we consider a family of Markov games where the reward function and transition kernel possess a linear structure. Both the offline and online settings of the problems are considered. In the offline setting, we control both players and aim to find the Nash equilibrium by minimizing the duality gap. In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret. For both settings, we propose an optimistic variant of the least-squares minimax value iteration algorithm. We show that our algorithm is computationally efficient and provably achieves an [Formula: see text] upper bound on the duality gap and regret, where d is the linear dimension, H the horizon and T the total number of timesteps. Our results do not require additional assumptions on the sampling model. Our setting requires overcoming several new challenges that are absent in Markov decision processes or turn-based Markov games. In particular, to achieve optimism with simultaneous moves, we construct both upper and lower confidence bounds of the value function, and then compute the optimistic policy by solving a general-sum matrix game with these bounds as the payoff matrices. As finding the Nash equilibrium of a general-sum game is computationally hard, our algorithm instead solves for a coarse correlated equilibrium (CCE), which can be obtained efficiently. To our best knowledge, such a CCE-based scheme for optimism has not appeared in the literature and might be of interest in its own right.
more » « less
Full Text Available

« Prev Next »

Search for: All records